Embedding Models: from Architecture to Implementation

Welcome to Embedding Models: from Architecture to Implementation.

Built in partnership with Vectara

You may have heard of Embedding Vectors being used in Generative AI applications. These vectors have an amazing ability to capture the meaning of a word or phrase.

Introduction to Embedding Models

In this lesson, you will learn:

Vector Embeddings

Vector embeddings map real-world entities such as a word, sentence, or image into vector representations, or a point in some vector space.

A key characteristic is that points in vector space that are similar to each other have a similar semantic meaning.

Word Embeddings

Word2Vec was the pioneering work on learning token or word embeddings that maintain semantic meaning.

These word embedding vectors behave like vectors in a vector space, allowing algebraic operations:

queen - woman + man ≈ king

Example from Star Wars text:

Yoda - good + evil ≈ Vader

A sentence embedding model applies the same principle to complete sentences, converting a sentence into a vector that represents its semantic meaning.

Applications of Vector Embeddings

Key Applications:

Retrieval in RAG

A critical component of any good RAG pipeline is the retrieval engine.

How it works:

Approaches for Ranking Text Chunks:

Contextualized Token Embeddings

In this lesson, you will learn:

Problem with Word Embeddings

Word embedding models like Word2Vec and GloVe don't understand context:

"The bat flew out of the cave at night."
"He swung the bat and hit the home run."

Using these models, both instances of "bat" would have the same vector embedding despite different meanings.

Transformer Architecture

In 2017, the paper "Attention is all you need" introduced the transformer architecture to NLP.

The transformer architecture was originally designed for translation tasks and had two components:

Encoder output vectors are the contextualized vectors we're looking for.

BERT Model

BERT is an encoder-only transformer model heavily used in sentence embedding models.

BERT Specifications:

BERT Pre-training Tasks:

Token vs. Sentence Embedding

In this lesson, you will learn:

Tokenization in NLP

NLP systems deal with tokens, which can be:

Each sentence is represented by a sequence of integer values corresponding to tokens.

Token Embeddings in BERT

BERT has a vocabulary of about 30,000 tokens and an embedding dimension of 768.

How Token Embeddings Work in BERT:

Creating Sentence Embeddings

After the success of word embeddings, researchers explored creating embedding vectors for sentences.

Initial (Failed) Approaches:

These approaches failed because they didn't properly capture the semantic meaning of the entire sentence.

Dual Encoder Architecture

Real progress in sentence embeddings came with the introduction of the dual encoder architecture.

Two Possible Goals for Sentence Encoders:

These are not the same goal. For example, for the question "What is the tallest mountain in the world?", we want the answer "Mount Everest is the tallest" rather than the same question as the answer.

The dual encoder architecture has two separate encoders (question encoder and answer encoder) and is trained using a contrastive loss.

Training a Dual Encoder

In this lesson, you will learn:

Dual Encoder Architecture:

Contrastive Loss

The idea behind contrastive loss is to ensure that:

In our context:

In PyTorch, we use cross-entropy loss with a trick: set the target argument to be zero, one, two, etc., indicating that the correct answer for each question is the one associated with it (the diagonal).

Building the Encoder

Encoder Components:

The final output is a contextualized embedding that can be used for similarity comparisons.

Training Loop

Training Process:

Using Embeddings in RAG

In this lesson, you will learn:

RAG Pipeline with Dual Encoder:

Approximate Nearest Neighbors

Finding matching chunks by computing similarity between the question embedding and all answer embeddings is computationally heavy.

Instead, we use Approximate Nearest Neighbors (ANN) algorithms:

These algorithms approximate nearest neighbor searches with high accuracy but significantly lower compute time.

For large datasets, implement ANN using a persistent data store on disk.

Full RAG Pipeline

RAG Implementation Options:

Full Pipeline Flow:

Conclusion

In this course, you learned about:

Two-Stage Retrieval Pipeline

A common practical approach is the two-stage retrieval (Retrieve and Rerank):

Additional Retrieval Techniques

While embedding models are essential for RAG, other retrieval techniques can complement neural search:

These techniques help ensure that the facts getting to the LLM are the most appropriate for responding to the user query.

Thank you for joining us to learn about sentence embeddings!

1 / 21